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PRINCIPAL  INVESTIGATOR:  Sumali  Conlon  (e-mail:  sconlon@bus.olemiss.edu) 

INSTITUTION:  University  of  Mississippi 

GRANT  TITLE:  Automatic  web  searching  and  categorizing  using  query  expansion  and  focusing 
AWARD  PERIOD:  July  1, 2001-  September  30, 2002 

OBJECTIVE:  To  build  a  prototype  system  that  improves  precision  and  recall  rates  for  web  search  using 
query  expansion  and  focusing  techniques.  We  use  linguistic  analysis  and  co-occurrence  information  to 
analyze  syntactic  structures  of  the  users’  queries  to  improve  search  results. 

APPROACH:  One  standard  method  of  improving  internet  search  is  through  query  expansion.  The  major 
query  expansion  techniques  add  terms  using  (i)  lexical  semantic  relations  and  (ii)  relevance  feed  back.  The 
lexical  semantic  relations  in  WordNet  have  been  used  widely  as  a  main  lexical  resource  for  approach  (i). 
Past  research  results  indicate  that  using  WordNet  did  not  significantly  improve  information  retrieval 
effectiveness.  Our  query  expansion  system  also  uses  WordNet  in  a  query  expansion  stage.  However, 
instead  of  just  adding  all  related  terms  from  WordNet  (synonyms,  hypemyms,  hyponyms,  etc.)  directly  into 
user’s  queries,  our  system  selects  only  useful  additional  terms.  This  selection  process  uses  syntactic 
analysis  combined  with  collocation  and  co-occurrence  information  from  a  large  corpus  collected  from  our 
domain  of  interest  (information  technology). 

This  work  requires  several  steps,  including: 

1)  Extracting  noun  and  proper  noun  phrases  from  web  documents 

2)  Collecting  co-occurrence  data  from  the  web  in  the  domain  of  interest. 

3)  Performing  query  expansion  using  information  in  the  lexical  database  WordNet 

4)  Performing  a  focusing  stage  using  co-occurrence  information  and  syntactic  analysis 

5)  Submiting  the  results  to  the  web  to  retrieve  additional  web  pages  using  the  expanded  phrases. 

ACCOMPLISHMENTS  (throughout  award  period): 

We  have  completed  many  of  the  tasks  listed  above.  However,  the  work  is  still  ongoing  since  natural 
language  processing  requires  an  extremely  sophisticated  knowledge  base  and  lexicon.  The  following 
describes  each  stage  of  the  work  which  has  been  accomplished  so  far. 


1)  Extracting  noun  and  proper  noun  phrases  from  web  documents 

This  part  benefits  from  the  previous  work  supported  by  ONR.  We  selected  proper  noun  phrases  and  other 
noun  phrases  semi-automatically  from  a  KWIC  (Key  Words  In  Context)  index  file.  The  KWIC  index  file 
was  created  from  data  collected  from  the  web  in  the  information  technology  domain.  This  data  set  allows 
us  to  find  many  proper  noun  phrases.  However,  many  proper  nouns  that  are  not  found  in  the  sources  on  the 
web  were  added  by  hand  from  other  sources.  Acronyms  were  also  collected. 

Noun  phrases,  proper  noun  phrases,  and  acronyms  are  important  in  internet  search  since  most  users’  queries 
are  short  and  are  in  the  form  of  such  expressions.  If  the  queries  are  in  the  form  of  proper  nouns  or 
acronyms,  the  system  identifies  them  from  these  extensive  proper  noun  lexicons.  They  will  not  have  to  be 
expanded.  The  query  “the  office  of  naval  research,”  for  example,  will  not  require  the  expansion  stage. 
This  query  will  be  sent  directly  to  the  search  engines  (Google  in  our  system).  The  returned  URLs  will  be 
the  URLs  that  the  search  engine  provides. 

2)  Collecting  co-occurrence  data  from  the  web  in  the  domain  of  interest. 

The  KWIC  program  produces  co-occurrence  data  that  helps  us  in  the  query  expansion  and  focusing  stage. 
Here  are  some  sample  items  in  this  file: 


W1  w2 _ 

1 .  to  the 

2 .  Apple  iMac 

3 .  ast-movingworld 

4.  immensely  powerful 

5.  The  aster 

6.  answer  sessions 

7 .  Cafe,  Poison, MT; 

8.  answer  sessions 

9.  for 

10.  techniques  for 

11.  satellites  and 

12.  a  high-speed, 

13.  Organizatin:Clientes 

14.  Possible  Tomorrows 

15.  answer  sessions 

16.  of  a 

17.  fast  and 

18.  key  innovations 

19.  turbidity  22  — 

20.  fast  and 

21.  fast  and 


w3 

w4 

Cray-1 

computer 

DV 

computers 

of 

computer 

IBM 

computer 

PC 

computer. 

High-speed 

computers 

High-speed 

computers 

High-speed 

computers 

high-speed 

computers 

high-speed 

computers 

high-speed 

computers 

high-capacity  computer 

Fast 

computer 

High-speed 

computers 

High-speed 

computers 

powerful 

computer 

powerful 

computers 

Powerful 

computer 

Powerful 

computers 

powerful 

computers 

powerful 

computers 

w5  w6 


which 

was 

becomes 

available 

technology, 

the 

that 

experts 

128+ 

MB 

Quality 

reference 

fine 

coffee. 

Quality 

reference 

and 

electronics. 

have 

developed 

and 

} 

containing 

data 

systems 

visit 

have 

made 

Quality 

reference 

algorithm 

written 

with 

qually 

modeling 

From 

developed 

with 

by 

become, 

they  will 

Table  1.  Key  Word  In  Context  (KWIC)  display  for  the  sentences  that  contain  the  word  “computer” 


The  actual  data  in  the  KWIC  file  contains  15  words  per  line.  However,  to  make  the  output  more  readable 
in  this  report,  we  only  show  six  words  per  line.  Column  4  contains  the  word  “computer”  while  the  words 
around  this  column  are  words  that  appear  before  or  after  the  word  in  question  in  the  web  documents  we 
collected.  Currently  our  KWIC  file  contains  more  than  ten  million  records. 

3)  Performing  standard  query  expansion  using  information  in  the  lexical  database  WordNet 

WordNet  is  a  lexical  database  generated  by  a  team  of  cognitive  scientists  at  Princeton  University.  It  is  the 
most  comprehensive  lexical  database  available  today  and  it  has  been  used  by  most  natural  language 
processing  researchers.  In  this  research,  we  use  entries  in  WordNet  to  perform  query  expansion. 

Queries  that  are  not  proper  nouns  or  acronyms  may  require  query  expansion.  Users  might  submit  queries 
that  consist  of  a  noun  possibly  modified  by  some  adjectives  (“big  screen  monitor,”  for  example).  Since 


there  are  many  phrases  that  represent  the  same  concepts,  users'  queries  may  not  match  the  terms  that  are 
used  by  writers  of  web  pages.  The  standard  query  expansion  process  is  intended  to  fix  this  problem.  The 
query  “fast  computer/’  for  example,  can  be  expanded  based  on  the  synonyms  in  WordNet,  by  finding  the 
cross  product  of  synonyms  of  each  term  in  the  query.  In  this  example,  “fast”  has  74  synonyms: 


accelerated 

botonee 

fast 

hurried 

locked 

pernickety 

scurrying 

tinted 

accelerating 

botonnee 

fastened 

hurrying 

meteoric 

persnickety 

secured 

upright 

alacritous 

button-down 

finical 

immediate 

meticulous 

pinned 

smooth 

vertical 

allegretto 

choosey 

finicky 

immoral 

moving 

prestissimo 

speeding 

vivace 

allegro 

choosy 

fixed 

instant(a) 

nice 

presto 

speedy 

winged 

andantino 

constant 

fleet 

instantaneous 

old-maidish 

prissy 

squeamish 

asleep(p) 

dainty 

fussy 

invasive 

old-womanish 

prompt 

stapled 

barred 

double-quick 

hastening 

jet-propelled 

ovemice 

quick 

steady 

blistering 

dyed 

high-speed 

knotted 

particular 

rapid 

straightaway 

bolted 

erect 

hot 

latched 

pegged-down 

red-hot 

swift 

Table  2.  Synonyms  for  “fast” 

Similarly,  the  word  “computer”  has  9  synonyms: 


Computer 

computing  machine 

computing  device 

data  processor 

electronic  computer 

information  processing  system 

calculator 

reckoner 

figurer 

estimator 


Table  2.  Synonyms  for  “computer” 


The  query  expansion  stage  will  produce  74  x  10  =  740  phrases.  Some  examples  are: 


Accelerated  Computer 
Accelerated  computing  machine 
Accelerated  computing  device 
Accelerated  data  processor 
Accelerated  electronic  computer 
Accelerated  information  processing  system 
Accelerated  calculator 
Accelerated  Reckoner 
Accelerated  figurer 
Accelerated  estimator 


Alacritous  Computer 
Alacritous  computing  machine 
Alacritous  computing  device 
Alacritous  data  processor 
Alacritous  electronic  computer 
Alacritous  information  processing  system 
Alacritous  calculator 
Alacritous  reckoner 
Alacritous  figurer 
Alacritous  estimator 


Allegretto  Computer 
Allegretto  computing  machine 
Allegretto  computing  device 
Allegretto  data  processor 
Allegretto  electronic  computer 
AUegretto  infonnation  processing  system 
Allegretto  calculator 
Allegretto  reckoner 
Allegretto  figurer 
Allegretto  estimator 


Table  3.  Some  phrases  produced  during  the  query  expansion  stage 

The  expanded  queries  may  help  improve  the  recall  rate  of  a  query  since  the  additional  terms  might  better 
match  words  in  the  web  documents.  However,  most  of  the  additional  terms  are  not  useful.  As  a  result,  if 
the  system  submits  all  of  these  phrases  as  additional  queries,  the  precision  rate  will  drop  tremendously.  In 
addition,  search  will  become  very  slow,  since  each  of  the  hundreds  of  queries  will  take  several  seconds  to 
process.  Thus,  we  must  find  ways  to  eliminate  some  useless  phrases.  The  next  step  describe  how  we  do 
this. 


4)  Improving  on  the  standard  method:  focusing  using  co-occurrence  information  and  syntactic 
analysis 


The  standard  query  expansion  stage  in  step  3  helps  us  to  find  alternatives  to  the  original  query  so  that  the 
search  engine  can  find  more  pages  that  match  the  original  query.  As  shown  above,  however,  most  of  the 
expanded  phrases  obtained  using  the  standard  method  are  not  useful. 

Thus,  in  this  stage  we  eliminate  many  of  these  useless  phrases  using  a  process  we  refer  to  as  “focusing  ” 
This  process  narrows  the  set  of  expanded  queries  down  to  the  most  useful  phrases,  using  co-occurrence 
data  (see  Table  1),  together  with  syntactic  analysis. 

The  main  idea  in  his  stage  is  that,  if  the  user  submits  a  query  that  is  not  a  proper  noun  or  an  acronym,  the 
system  will  try  to  find  the  phrases  that  represent  the  same  concept  as  expressed  by  the  user’s  query.  It 
starts  by  expanding  the  original  query  by  producing  the  cross  product  of  the  synonyms  of  each  term  in  the 
query.  The  system  then  selects  the  useful  phrases  by  learning  from  the  previously  collected  web  pages 
which  combinations  of  synonyms  make  most  sense.  If  the  query  is  “fast  computer,”  the  system  should 
produce  additional  queries  like  “high-speed  computer,”  “high-speed  parallel  computer,”  “powerful 
computer,”  or  “fast  and  powerful  computer.”  However,  phrases  like  “Accelerated  Computer*,”  “Alacritous 
computing  machine*,”  or  “rapid  growing  computer  company*  ” 

To  accomplish  this  the  system  performs  two  stages: 

1)  It  uses  the  co-occurrence  data  to  find  the  phrases  that  make  most  sense.  In  this  example,  the  query 
“fast  computer”  may  have  “high-speed  computer”  as  its  synonym  but  not  “accelerated  computer.” 
This  is  because  the  phrase  “accelerated  computer”  never  appears  in  the  KWIC  file,  so  the  word 
“accelerated”  must  not  make  much  sense  in  connection  with  “computer.”  In  this  step,  we  are  able  to 
eliminate  many  phrases  produced  by  the  previous  stage. 

2)  The  system  performs  syntactic  analysis  to  find  relevant  phrases  that  are  written  differently  from  the 
expanded  queries  from  the  previous  step  (this  work  is  currently  ongoing).  In  addition  to  the  phrase 
“high-speed”  computer,”  for  example,  obtained  from  the  original  query  “fast  computer,”  there  may  be 
other  phrases  like  “fast  and  powerful  computer,”  or  “high-speed  parallel  computer.”  To  accomplish 
this  stage,  we  use  syntactic  rules  for  noun  phrases  such  as: 


NP  ->  N 

ARTN 
ADJN 
ADJ  ADJN 
ADJ  ADJ  ADJN 
ART  ADJN 
ART  ADJ  ADJN 
N1  N2 

N1  N2  N3 


computer 

the  computer 

high-speed  computer 

high-speed  parallel  computer 

fast  parallel  network  computer 

a  high-speed  computer 

a  fast  parallel  computer 

network  computer  (N1  is  a  noun  but  serves 

as  an  adjective  and  N2  is  a  main  noun) 

network  IBM  computer  (Nl,  N2,  and  N3  are 

nouns  but  Nl  and  N2  serve  as  adjectives  while 

N3  is  a  main  noun) 


This  indicates  that  the  synonyms  of  each  word  in  the  query  can  appear  in  many  positions  as  long  as  they 
follow  one  of  these  rules.  These  rules  also  tell  us  that  a  phrase  like  “rapid  growing  computer  company*”  is 
about  the  “company”  not  the  “computer”  so  the  adjective  “rapid”  can  not  be  used  to  modify  “computer.” 

5)  Submit  the  results  to  the  web  to  retrieve  additional  web  pages  using  the  expanded  phrases. 

This  stage  uses  the  expanded  phrases  as  search  queries  to  send  to  the  search  engine.  The  retuned  results  are 
URLs  that  include  results  from  the  original  query  and  the  expanded  queries. 


CONCLUSIONS:  Our  query  expansion  techniques  include  two  major  stages:  the  standard  expansion 
phase,  and  a  new  focusing  phase,  that  selects  among  the  expanded  phrases  to  produce  a  subset  of  phrases 
that  should  make  sense  to  ordinary  language  users. 


SIGNIFICANCE:  Though  we  have  not  yet  been  able  to  perform  a  systematic  evaluation  of  our  approach, 

our  initial  results  show  promise  to  improve  precision  and  recall  rates  for  internet  search. 
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